Information retrieval

Ahsan Ijaz

Information retreival: Applications

Document retrieval

  • Currently reading an article

Document retrieval

  • Currently reading an article
  • Goal: Want to find an article similar to that one

Document retrieval

  • How do we measure similarity?
  • How do we search over articles?

Retrieval as nearest neighbor search

  • Organize all articles by similarity of text

Compute distances with all other articles

  • Organize all articles by similarity of text

Retrieve 'nearest neighbor'

  • Organize all articles by similarity of text

Retrieve set of nearest neighbors

  • Organize all articles by similarity of text

Elements of NN search

  • Item (e.g., doc) representation \(x_q\)
  • Measure of distance between items : \[ \xi = \textit{distance}(x_i, x_q) \]

Document representation

Word count document representation

Bag of words model

  • Igonore order of words
  • Count # of instances of each word in vocabulary

Sample sentence with matrix:

Issues with word counts - Rare words

  • Common words in doc: "the", "player", "field",...
  • Dominate rare words like: "cricket", "Shane Watson"

TF-IDF document representation

  • Emphasizes important words

    • Appears frequently in document (common locally)
    • Term frequency = Word Count
  • Appears rarely in corpus (rare globally)

\[ \textit{Inverse doc frequency} = \log\frac{\textit{#docs}}{1 + \textit{#docs using the word}} \]

  • tf*idf

IDF function

plot of chunk unnamed-chunk-1

Distance metrics

Notion of closeness?

  • Euclidean distance:

\[ \textit{distance}(x_i,x_q) = |x_i - x_q| \]

Weighting different features

  • Some features might be more important than others
  • Weighing features in distance metrics

Scaled euclidean distance

  • Weight on each feature
  • Setting weights as zero or one is same as feature selection

\[ \textit{distance}(x_i,x_q) = \sqrt{(a_1(x_i[1]-x_q[1])^2+\ldots+a_d(x_i[d]-x_q[d])^2))} \]

Inner product similarity

Inner product similarity

Cosine similarity (Normalized inner product)

\[ \frac{\mathbf{x}_i^{T}\mathbf{x}_q}{\|x_i\|\|x_q\|} \]

Unnormalized case

Normalized case

Undesired case of normalization